181-2007: Using SAS® PROC NLMIXED for Robust Regression

نویسندگان

  • Steven A. Gilbert
  • Ling Chen
چکیده

We show how the NLMIXED procedure can be used to fit linear models (simple linear regression, multiple linear regression, analysis of variance and analysis of covariance) when data are not normally distributed or contaminated with potential outliers. This is accomplished using PROC NLMIXED’s capability of fitting user defined distributions through the general log likelihood option in the Model statement. This methodology can offer advantages over the Mestimation option in Proc ROBUSTREG which cannot analyze repeated measures and Proc NPAR1WAY which only provides p-values and cannot adjust for covariates or handle repeated measures. We provide some the basic theory and directions for specifying the likelihoods needed and then demonstrate the usefulness of the method on several examples. INTRODUCTION Linear models are often fitted using the REG, GLM, GENMOD, and MIXED procedures. With the exception of the GENMOD procedure the remaining procedures always assume that the data, Y, can be modeled as Y= mean function + error, where the errors have a normal distribution. When the data consist of independent observations the procedures will use either ordinary least squares (OLS) or weighted least squares (WLS), if weights are specified in the procedure call. The normality assumption and presence of outliers can be assessed by examining the residuals from the model. All of these procedures supply raw residuals, calculated as the difference between the observed data value and the estimated mean function, residual=Y-estimated mean function. The raw residuals can then be examined for non-normality and for outliers by graphical means and with Proc UNIVARIATE. It is not uncommon to find that the residuals display one of the following: heavier tails than expected for a normal distribution; skewness; or a small number of larger than expected residuals, which may reflect outliers. In this paper we focus on the presence of outliers and heavy tailed residual distributions. There is no single approved definition of an outlier; we will use the intuitive but imprecise definition that an outlier is an observation that does not fit in with the bulk of the data. In a regression model outlying data points are often identified as observations with residuals that are much larger (in absolute value) than the remaining residuals. Note that residuals are dependent upon the model. It is possible for apparent outliers to be produced by an incorrect model and can be corrected by replacing an incorrect model with the correct model. Another possibility is that the data have been contaminated in some way. For example, in a clinical trial we may need to measure a subject’s blood pressure. There can be measurement error due to equipment malfunction or operator error, or the blood pressure may simply be written down incorrectly on the Case Report Form. It is also possible that one or two subjects decide to go on an espresso and cappuccino binge artificially raising their blood pressure. Heavy tailed distributions are another difficulty that often arises in real world data. The term “heavy tail” is often used in comparison to the relatively “light tail” of a normal distribution. When plotted, the probability density function (pdf) of a heavy tailed distribution will exhibit thicker tails than a normal distribution, that is, the heavy tailed distribution allows for a higher probability of relatively extreme events. Almost all observations generated from a normal distribution will be within plus or minus three standard deviations from the mean (99.8% of the time to be precise). Heavy tailed distribution allow for a larger number of observations far from the mean. Heavy tailed distributions are not uncommon. They may be either an intrinsic characteristic of the data or they may be a byproduct of the analysis method. For an example of the first kind, many types of financial data empirically demonstrate heavy tails (Bouchaud and Potter, 2000). An example where the analyst creates a heavy tailed distribution can be found in clinical trials. Many clinicians like to look at percent changes, 100% X (post baselinebaseline)/baseline. If some baseline values are close to zero, extremely large percent changes (either positive or negative) can result. This is problematic enough, but if the measured variable can only take on positive values, the percent change variable can also be highly skewed. SAS Global Forum 2007 Statistics and Data Analysis

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

179-2007: Using PROC NLMIXED and PROC GLMMIX to Analyze Dyadic Data with a Dichotomous Dependent Variable

In the social and health sciences, data are often hierarchical (subjects nested in groups). One kind of hierarchy is the dyad, or couple, where each group consists of two subjects. Dyadic data pose particular problems for statistical analysis for several reasons: First, variation may occur at the individual or dyadic level. Second, the data are not independent. Third, the small group size poses...

متن کامل

Use of the Probability Integral Transformation to Fit Nonlinear Mixed-Effects Models With Nonnormal Random Effects

This article describes a simple computational method for obtaining the maximum likelihood estimates (MLE) in nonlinear mixed-effects models when the random effects are assumed to have a nonnormal distribution. Many computer programs for fitting nonlinear mixed-effects models, such as PROC NLMIXED in SAS, require that the random effects have a normal distribution. However, there is often interes...

متن کامل

The Comparative Analysis of Predictive Models for Credit Limit Utilization Rate with SAS/STAT®

Credit card usage modelling is a relatively innovative task of client predictive analytics compared to risk modelling such as credit scoring. The credit limit utilization rate is a problem with limited outcome values and highly dependent on customer behavior. Proportion prediction techniques are widely used for Loss Given Default estimation in credit risk modelling (Belotti and Crook, 2009; Ars...

متن کامل

Analysis of Survival Data with Clustered Events

Two methods to analyzing survival data with clustered events are presented. The first method is a proportional hazards model which adopts a marginal approach with a working independence assumption. This model can be fitted by SAS PROC PHREG with the robust sandwich estimate option. The second method is a likelihood-based random effects (frailty) model. In the second model, the baseline hazard c...

متن کامل

Multiple Ways to Detect Differential Item Functioning in SAS

Differential item functioning (DIF), as an assessment tool, has been widely used in quantitative psychology, educational measurement, business management, and insurance and healthcare industries. The purpose of DIF analyses is to detect response differences of items in questionnaires, rating scales, or tests across different subgroups (e.g., gender), while controlling for ability level. There a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007